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Abstract 

We address the problem of finding a "best" deterministic query answer to a query over a probabilistic database. 
For this purpose, we propose the notion of a consensus world (or a consensus answer) which is a deterministic 
world (answer) that minimizes the expected distance to the possible worlds (answers). This problem can be seen 
as a generalization of the well-studied inconsistent information aggregation problems (e.g. rank aggregation) to 
probabilistic databases. We consider this problem for various types of queries including SPJ queries, Top-k queries, 
group-by aggregate queries, and clustering. For different distance metrics, we obtain polynomial time optimal or 
approximation algorithms for computing the consensus answers (or prove NP-hardness). Most of our results are for a 
general probabilistic database model, called and/xor tree model, which significantly generalizes previous probabilistic 
database models like x-tuples and block-independent disjoint models, and is of independent interest. 

1 Introduction 

There is an increasing interest in uncertain and probabilistics databases arising in application domains such as in- 
formation retrieval lfTTll35l . recommendation systems fl32ll33l . mobile object data management \%\, information ex- 
traction ll20l . data integration [3] and sensor networks lfl3l . Supporting complex queries and decision-making on 
probabilistic databases is significantly more difficult than in deterministic databases, and the key challenges include 
defining proper and intuitive semantics for queries over them, and developing efficient query processing algorithms. 

The common semantics in probabilistic databases are the "possible worlds" semantics, where a probabilistic 
database is considered to correspond to a probability distribution over a set of deterministic databases called "pos- 
sible worlds". Therefore, posing queries over such a probabilistic database generates a probability distribution over a 
set of deterministic results which we call "possible answers". However, a full list of possible answers together with 
their probabilities is not desirable in most cases since the size of the list could be exponentially large, and the proba- 
bility associated with each single answer is extremely small. One approach to addressing this issue is to "combine" 
the possible answers somehow to obtain a more compact representation of the result. For select-project-join queries, 
for instance, one proposed approach is to union all the possible answers, and compute the probability of each result 
tuple by adding the probabilities of all possible answers it belongs to ifTTl . This approach, however, can not be easily 
extended to other types of queries like ranking or aggregate queries. 

Furthermore, from the user or application perspective, despite the probabilistic nature of the data, a single, deter- 
ministic query result would be desirable in most cases, on which further analysis or decision-making could be based. 
For SPJ queries, this is often achieved by "thresholding", i.e., returning only the result tuples with a sufficiently high 
probability of being true. For aggregate queries, often expected values are returned instead l24l . For ranking queries, 
on the other hand, a range of different approaches have been proposed to find the true ranking of the tuples. These 
include UTop-k, URank-k (37), probabilistic threshold Top-k function l22l . Global Top-k 1431 . expected rank [9|, and 
so on. Although these definitions seem to reason about the ranking over probabilistic databases in some "natural" 
ways, there is a lack of a unified and systematic analysis framework to justify their semantics and to discriminate the 
usefulness of one from another. 

In this paper, we consider the problem of combining the results for all possible worlds in a systematic way by 
putting it in the context of inconsistent information aggregation which has been studied extensively in numerous 
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contexts over the last half century. In our context, the set of different query answers returned from possible worlds can 
be thought as inconsistent information which we need to aggregate to obtain a single representative answer. To the best 
of our knowledge, this connection between query processing in probabilistic databases and inconsistent information 
aggregation, though natural, has never been realized before in any formal and mathematical way. Concretely, we 
propose the notion of the consensus answer. Roughly speaking, the consensus answer is a answer that is closest to 
the answers of the possible worlds in expectation. To measure the closeness of two answers t\ and t 2 , we have to 
define suitable distance function d (ti , t 2 ) over the answer space. For example, if an answer is a vector, we can simply 
use the L 2 norm; whereas in other cases, for instance, Top-k queries, the definition of d is more involved. If the most 
consensus answer can be taken from any point in the answer space, we refer it as the mean answer. A median answer is 
defined similarly except that the median answer must be the answer of some possible world with non-zero probability. 

From a mathematical perspective, if the distance function is properly defined to reflect the closeness of the answers, 
the most consensus answer is perhaps the best deterministic representative of the set of all possible answers since it 
can be thought as the centroid of the set of points corresponding to the possible answers. Our key results can be 
summarized as follows: 

• (Probabilistic And/Xor Tree) We propose a new model for modeling correlations, called the probabilistic and/xor 
tree model, that can capture two types of correlations, mutual exclusion and coexistence. This model generalizes 
the previous models such as x-tuples, and block-independent disjoint tuples model. More important, this model 
admits an elegant generating functions based framework for many types of probability computations. 

• (Set Distance Metrics) We show that the mean and the median world can be found in polynomial time for the 
symmetric difference metric for and/xor tree model. For the Jaccard distance metric, we present a polynomial time 
algorithm to compute the mean and median world for tuple independent database. 

• (Top-k ranking Queries) The problem of aggregating inconsistent rankings has been well-studied under the name 
of rank aggregation lfT4l . We develop polynomial time algorithms for computing mean and median Top-k answers 
under the symmetric difference metric, and the mean answers under intersection metric and generalized Spearman 's 
footrule distance lfl6l . for the and/xor tree model. 

• (Groupby Aggregates) For group by count queries, we present a 4-approximation to the problem of finding a median 
answer (finding mean answers is trivial). 

• (Consensus Clustering) We also consider the consensus clustering problem for the and/xor tree model and get a 
constant approximation by extending a previous result Q . 

Outline: We begin with a discussion of the related work (Section [2]). We then define the probabilistic and/xor tree 
model (Section[3]), and present a generating functions-based method to do probability computations on them (Section 
13.31 ). The bulk of our key results are presented in Sections|4]and|5]where we address the problem of finding consensus 
worlds for different set distance metrics and for Top-k ranking queries respectively. We then briefly discuss finding 
consensus worlds for group-by count aggregate queries and clustering queries in Section[6] 

2 Related Work 

There has been much work on managing probabilistic, uncertain, incomplete, and/or fuzzy data in database systems 
and this area has received renewed attention in the last few years (see e.g. il23l 151 [28l |T9l im FTl [8l im l40l rT~8l ~). This 
work has spanned a range of issues from theoretical development of data models and data languages, to practical 
implementation issues such as indexing techniques. In terms of representation power, most of this work has either 
assumed independence between the tuples H21QTI], or has restricted the correlations that can be modeled f5!i28l[3l [34l . 
Several approaches for modeling complex correlations in probabilistic databases have also been proposed l35l HI [36] 

m. 

For efficient query evaluation over probabilistic databases, one of the key results is the dichotomy of conjunctive 
query evaluation on tuple-independent probabilistic databases by Dalvi and Suciu lfTTl[T2l . Briefly the result states 
that the complexity of evaluating a conjunctive query over tuple-independent probabilistic databases is either PTIME 
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or #P-complete. For the former case, Dalvi and Suciu D a ls° present an algorithm to find what are called safe query 
plans, that permit correct extensional evaluation of the query. Unfortunately the problem of finding consensus answers 
appears to be much harder; this is because even if a query has a safe plan, the result tuples may still be arbitrarily 
correlated. 

In recent years, there has also been much work on efficiently answering different types of queries over probabilistic 
databases. Soliman et al. 1371 first considered the problem of ranking over probabilistic databases, and proposed two 
ranking functions to combine the tuple scores and probabilities. Yi et al. 14B presented improved algorithms for 
the same ranking functions. Zhang and Chomicki l43l presented a desiderata for ranking functions and propose 
Global Top-k queries. Ming Hua et al. lETl l22l recently presented a different ranking function called Probabilistic 
threshold Top-k queries. Finally, Cormode et al. |9J also present a semantics of ranking functions and a new ranking 
function called expected rank. In a recent work, we proposed a parameterized ranking function, and presented general 
algorithms for evaluating them |29l Other types of queries have also been recently considered over probabilistic 
databases (e.g. clustering 1101 . nearest neighbors [6] etc.). 

The problem of aggregating inconsistent information from different sources arises in numerous disciplines and has 
been studied in different contexts over decades. Specifically, the Rank- AGGREGATION problem aims at combining k 
different complete ranked lists T\, . . . , Tfc on the same set of objects into a single ranking, which is the best description 
of the combined preferences in the given lists. This problem was considered as early as 18th century when Condorcet 
and Borda proposed a voting system for elections 13T1I251 . In the late 50's, Kemeny proposed the first mathematical 
criterion for choosing the best ranking ll26ll . Namely, the Kemeny optimal aggregation r is the ranking that minimizes 
Si=i d( T J T i)> where d(r^, Tj) is the number of pairs of elements that are ranked in different order in and Tj (also 
called Kendall's tau distance). While computing the Kemeny optimal is shown to be NP-hard fl5l . 2-approximation 
can be easily achieved by picking the best one from k given ranking lists. The other well-known 2-approximation 
is from the fact the Spearman footrule distance, defined to be dp(Ti, Tj) = J^t l T »M — r jWI' i s within twice the 
Kendall's tau distance and the footrule aggregation can be done optimally in polynomial time 1141 . Ailon et al. 12 
improve the approximation ratio to 4/3. We refer the readers to ||27l for a survey on this problem. For aggregating 
Top-k answers, Ailon HJ recently obtained an 3/2-approximation based on rounding an LP solution. 

The Consensus-Clustering problem asks for the best clustering of a set of elements which minimizes the num- 
ber of pairwise disagreements with the given k clusterings. It is known to be NP-hard l42l and a 2-approximation 
can also be obtained by picking the best one from the given k clusterings. The best known approximation ratio is 4/3 
due to Ailon et al. Q. Recently Cormode et al. iflOll proposed approximation algorithms for fc-center and fc-median 
clustering problems under attribute-level uncertainty in probabilistic databases. 

3 Preliminaries 

We begin with reviewing the possible worlds semantics, and introduce the probabilistic and/xor tree model. 

3.1 Possible World Semantics 

We consider probabilistic databases with both tuple-level uncertainty (the existence of a tuple is uncertain) and 
attribute-level uncertainty (a tuple attribute value is uncertain). Specifically, we denote a probabilistic relation by 
R P (K;A), where K is the key attribute, and A is the value attribut^l For a particular tuple in R p , its key at- 
tribute is certain and is sometimes called the possible worlds key. R p is assumed to correspond to a probabil- 
ity space (PW, Pr) where the set of outcomes is a set of deterministic relations, which we call possible worlds, 
PW = {pwi,pw2, ....jPivn}. Note that two tuples can not have the same value for the key attribute in a single possi- 
ble world. Because of the typically exponential size of PW, an explicit possible worlds representation is not feasible, 
and hence the semantics are usually captured implicitly by probabilistic models with polynomial size specification. 

Let T denote the set of tuples in all possible worlds. For ease of notation, we will use t £ pw in place of "t appears 
in the possible world pw", Pr(t) to denote Pr(t is present) and Pr(^t) to denote Pr(t is not present). 

'For clarity, we will assume singleton key and value attributes. 
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Figure 1: (i) The and/xor tree representation of a set of block-independent disjoint tuples; the generating function 
obtained by assigning the same variable x to all leaves gives us the distribution over the sizes of the possible worlds, 
(ii) Example of a highly correlated probabilistic database with 3 possible worlds and (iii) the and/xor tree that captures 
the correlation; the coefficient of y (0.3) is Pr(r(^3, 6) = 1) (i.e., prob. that that alternative of t% is ranked at position 
1). 



Further, for a tuple t p € R p , we call the certain tuples corresponding to it (with the same key value) in the union 
of the possible worlds, its alternatives. 

Block-Independent Disjoint (BID) Scheme: BID is one of the more popular models for probabilistic databases, and 
assumes that different probabilistic tuples (with different key values) are independent of each other ffTTl |40l [T2l [38l . 
Formally, a BID scheme has the relational schema of the from R(K; A; Pr) where K is the possible worlds key, A is 
the value attribute, and Pr captures the probability of the corresponding tuple alternative. 



3.2 Probabilistic And/Xor Tree 

We generalize the block-independent disjoint tuples model, which can capture mutual exclusion between tuples, by 
adding support for mutual co-existence, and allowing these to be specified in a hierarchical manner. Two events 
satisfy the mutual co-existence correlation if in any possible world, either both happen or neither occurs. We model 
such correlations using a probabilistic and/xor tree (or and/xor tree for short), which also generalizes the notions 
of x-tuples ll34l |4D . p-or-sets lfl2l and tuple independent databases. We first considered this model for tuple-level 
uncertainty in an earlier paper (29), and generalize it here to handle attribute-level uncertainty. 

We use @ (or) to denote mutual exclusion and ® (and) for coexistence. Figure Q] shows two examples of proba- 
bilistic and/xor trees. Briefly, the leaves of the tree correspond to the tuple alternatives (we abuse the notation somewhat 
and use ti to denote both the tuple, and its key value). The first tree captures a relation with four independent tuples, 
ti,t2,ts, ti, each with two alternatives, whereas the second tree shows how we can capture arbitrary possible worlds 
using an and/xor tree (Figure[TJii) shows the possible worlds corresponding to that tree). 

Now, let us formally define a probabilistic and/xor tree. In tree T, we denote the set of children of node v by 
Chr{v) and the least common ancestor of two leaves Zi and l 2 by LCA-rili, h)- We omit the subscript if the context 
is clear. 

Definition 1 A probabilistic and/xor tree T represents the mutual exclusion and co-existence correlations in a prob- 
abilistic relation R P (K; A), where K is the possible worlds key, and A is the value attribute. In T, each leaf is a 
key-attribute pair (a tuple alternative), and each inner node has a mark, @ or @. For each @ node u and each of its 
children v € Ch{u), there is a nonnegative value Pr(it, v) associated with the edge (u, v). Moreover, we require 

• (Probability Constraint) J2vv£Ch(u) P f ( u ,v) < 1. 

• (Key Constraint) For any two different leaves l\, l 2 holding the same key, LCA(li, I2) is a @ nod^ 

Let T v be the subtree rooted at v and Ch(v) = {vi, . . . , ve}. The subtree T v inductively defines a random subset S v 
of its leaves by the following independent process: 

- The key constraint is imposed to avoid two leaves with the same key coexisting in a possible world. 
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• If v is a leaf, S v = {v}. 

• If T v roots at a @ node, then 

( S Vi with prob Pr(v,Vi) 
v \ with prob 1 — J2i=i ^ r ( v J v i) 

• IfT v roots at a © node, then S v = uf =1 S Vi 

Probabilistic and/xor trees can capture more complicated correlations than the prior models such as the BID model 
or x-tuples. We remark that Markov or Bayesian network models are able to capture more general correlations l35ll . 
however, the structure of the model is more complex and probability computations on them (inference) is typically 
exponential in the treewidth of the model. The treewidth of an and/xor tree (viewing it as a Markov network) is not 
bounded, and hence the techniques developed for those models can not be used to obtain a polynomial time algorithms 
for and/xor trees. 



3.3 Computing Probabilities on And/Xor Trees 

Aside from the representational power of the and/xor tree model, perhaps its best feature is that many types of proba- 
bility computations can be done efficiently and elegantly on them using generating functions. In our prior work |29l , 
we used a similar technique for computing ranking functions for tuple-level uncertainty model. Here we generalize 
the idea to a broader range of probability computations. 

We denote the and/xor tree by T. Suppose X = {x\, x-x, . . .} is a set of variables. Define a mapping s which 
associates each leaf I G T with a variable s(l) G X. Let T v denote the subtree rooted at v and let vi, . . . , vi be v's 
children. For each node v G T, we define a generating function T v recursively: 

• If v is a leaf, T% (X) = s(v). 

• If v is a @ node, 

T V {X) = (1 -Y! h=1 p{v,v h )) + Y! h=l T Vh {X) -P(v,vh) 

• If v is a ® node, T l v {X) = JlLi T Vh [X). 

The generating function T(X) for tree T is the one defined above for the root. It is easy to see, if we have a 
constant number of variables, the polynomial can be expanded in the form of i fyi,i 2 --- x% i x% 2 ■ ■ ■ m polynomial 
time. 

Now recall that each possible world pw contains a subset of the leaves of T (as dictated by the @ and ® nodes). 
The following theorem characterizes the relationship between the coefficients of T and the probabilities we are inter- 
ested in. 

Theorem 1 The coefficient of the term JJ . Xj" in T(X) is the total probability of the possible worlds for which, for 
all j, there are exactly ij leaves associated with variable Xj. 

The proof is by induction on the tree structure and is omitted. 

Example 1 If we associate all leaves with the same variable x, the coefficient of x 1 is equal to Pr(\pw\ = i). 
The above can be used to obtain a distribution on the possible world sizes (Figure[TJi))- 

Example 2 If we associate a subset S of the leaves with variable x, and other leaves with constant 1, the coefficient 
of x l is equal to Pr(\pw fl i5| = t). 

Example 3 Next we show how to compute Pr(r(t) — i) (i.e., the probability t is ranked at position i), where r(t) 
denote the rank of the tuple in a possible world by some score metric. Assume t only has one alternative, (t, a), and 
hence only one possible value of score, s. Then, in the and/xor tree T, we associate all leaves with key other than t 
and score value larger than s with variable x, and the leaf(t, a) with variable y, and the rest of leaves with constant 1. 
Then, the coefficient of x^ 1 y in the generating function is exactly Pr(r(t) = i). If the tuple has multiple alternatives, 
we can compute Pr(r(t) = i)for it by summing up the probabilities for the alternatives. 
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See Figure[TJiii) for an example. 

3.4 Problem Definition 

We denote the domain of answers for a query by il and the distance function between two answers by d(). Formally, 
we define the most consensus answer r to be a feasible answer to the query such that the expected distance between r 
and the answer r pw of the (random) world pw is minimized, i.e, r = argmin T ' 6 o{E[d(T', r pM ,)]}. 

We call the most consensus answer in the mean answer when il is the set of all feasible answers. If O is 
restricted to be the set of possible answers (answers of some possible worlds with non-zero probability), we call the 
most consensus answer in SI the median answer. Taking the example of the Top-k queries, the median answer must be 
the Top-k answer of some possible world while the mean answer can be any sorted list of size k. 



4 Set Distance Measures 

We first consider the problem of finding the consensus world for a given probabilistic database, under two set distance 
measures: symmetric difference, and Jaccard distance. 

4.1 Symmetric Difference 

The symmetric difference distance between two sets Si, S2 is defined to be d a (Si, S2) = IS1AS2I = |(Si \ S2) U 
(S2 \ Si) |- Note that two different alternatives of a tuple are treated as different tuples here. 



if D — tTUG 

Proof: Suppose S is a fixed set of tuples and S = T — S. Let 6(p) = i ' . f _ f be the indicator function. 



Theorem 2 The mean world under the symmetric difference distance is the set of all tuples with probability > 0.5. 

1, if p = true 
0, if p = false 
We write E pwe pw[dA{S,pw)] as follows: 

E[d A (S, pw)] = E[]T S(t £ P w) + Y^ S(t e pw)] 
tes tes 

= ^2 e[«(* 1 pw)] + J2 6 = E Pr H) + E Pr M 

tes tes tes tes 

Thus, each tuple t contributes Pr(— >t) to the expected distance if t S S and Pr(t) otherwise, and hence the minimum 
is achieved by the set of tuples with probability 0.5. □ 

Finding the consensus median world is somewhat trickier, with the main concern being that the world that contains 
all tuples with probability > 0.5 may not be a possible world. 

Corollary 1 If the correlation can be modeled by a probabilistic and/xor tree, the median world is the set contains all 
tuples with probability greater than 0.5. 

The proof is by induction on the height of the tree, and is omitted for space constraints. This however does not 
hold for arbitrary correlations, and it is easy to see that finding a median world is NP-Hard even if result tuple prob- 
ability computation is easy. We show a reduction to MAX-2-SAT for a simple 2-relation query. Let the MAX- 
2-SAT instance consists of n literals, xi,...,x n , and k clauses. Consider a query R m S, where S(x,b) = 
{(xi, 0), (xi, 1), (x2, 0), (x2, 1), • • • } contains two mutually exlusive tuples each for n literals; all tuples are equi- 
probable with probability 0.5. R(C, x, b) is a certain table, and contains two tuples for each clause: for the clause 
Ci = Xi V x~2, it contains tuples (ci,xi, 1) and (ci,X2,0). The result of nc(R w S) contains one tuple for each 
clause, associated with a probability of 0.75. So the median answer is the possible answer containing maximum 
number of tuples, which corresponds to finding the assignment to a;,'s that maximizes the number of satisfied clauses. 
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4.2 Jaccard Distance 

The Jaccard distance between two sets Si, S 2 is defined to be dj-(<Si, S 2 ) = j§7ng~j • Jaccard distance always lies in 
[0, 1] and is a real metric, i.e, satisfies triangle inequality. Next we present polynomial time algorithms for finding the 
mean and median worlds for tuple independent databases, and median world for the BID model. 

Lemma 1 Given an and/xor tree, T and a possible world for it, W (corresponding to a set of leaves off), we can 
compute E[d(W, pw)] in polynomial time. 

Proof: A generating function Tt is constructed with the variables associated with leaves as follows: for t € W 
(t W), the associated variable is x (y). For example, in a tuple independent database, the generating function is: 

F(x,y) = J] (PrH) + Pr(t)s) ]J (Pr(^) + Pr(i)y) 

tew tfw 

From Theorem Q] the coefficient c^j of term x b yi in generating function T is equal to the total probability of 
the worlds such that the Jaccard distance between those worlds and W is exactly ^^J_^" J . Thus, the distance is 

\W\+j ' 

Lemma 2 For tuple independent databases, if the mean world contains tuple t\ but not tuple t% then Px(t\) > Pr(i 2 ). 

Proof: Say W\ is the mean world and the lemma is not true, i.e, 3t\ 6 Wi,t 2 $ W\ s.t. Pr(ii) < Pr(i 2 )- Let 
W = Wi - {ti}, W 2 = W + {t 2 } and W = T - W — {t{\ - {t 2 }. We will prove W 2 has a smaller expected Jaccard 
distance, thus rendering contradiction. Suppose \W%\ = \W 2 \ = k. We let matrix M = [mi,j]i t j where m,.j = fc fc ^-' . 
We construct generating functions as we did in Lemma [T] Suppose T\ and T 2 are the generating functions for W\ 
and W 2 , respectively. We write 1 1 A| | = J^i j a i,j f° r an y matrix A and let A <£> B the Hadamard product of A and B 
(take product entrywise). We denote: 

F(x, V) = Utew ( Pr H) + Pr(t)x) UteW ( Pr H) + Pr W^) 
We can easily see: 

^(x, y) = F'{x,y) (Pr(-*i) + Pr(ti)x) (Pr(-rfa) + Pr(*a)y) 
F 2 (x, y) = T\x, y) (Pr(-*i) + Pr{h)y) (Pr(-<i 2 ) + Pr(t 2 )x) 
Then, taking the difference, we get T = T\(x, y) — J- 2 (x, y) is equal to: 

J*(x, y) (PrHi)Pr(i 2 ) - Pr(t 1 )Pr(^ 2 )) (y - x) (1) 

Let Cjr = [ci.j] be the coefficient matrix of T where c%,j is the coefficient of term x l yK Using the proof of Lemma[T] 

E[d(Wi,pw)] -E[d(W 2 ,pw)] = ||C^®M||-||C^ 2 ®M|| 

= ||C^®M|| 

Let j and Cij be the coefficient of x l yj in T' and T, respectively. It is not hard to see Cij = (c[ j_ 1 — c' i _ 1 j)p from 
O where p = (Pr(-iti)Pr(t 2 ) - Pr(*i)Pr(-it 2 )) > 0. Then we have 



|C^®M|| = p 

= pYI c *,3 ( m » J+ 1 _ m «'+i.j ) 

= p£<4 



h3 

k-i+j+1 k—i-l+j 



~i,3 

hi 



k+j+1 k+j 

Due to the fact that fc ~^+ 1 - k ~^ +i > for any i, j > 0, the proof is complete. □ 
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The above two lemmas can be used to efficiently find the mean world for tuple-independent databases, by sorting 
the tuples in the decreasing order by probabilities, and computing the expected distance for every prefix of the sorted 
order. 

A similar algorithm can be used to find the median world for the BID model (by only considering the highest 
probability alternative for each tuple). Finding mean worlds or median worlds under more general correlation models 
remains an open problem. 

5 Top-k Queries 

In this section, we consider Top-k queries in probabilistic databases. Each tuple ti has a score s(U). In the tuple-level 
uncertainty model, s(U) is fixed for each ti, while in the attribute-level uncertainty model, it is an random variable. 
In the and/xor tree model, we assume that the attribute field is the score (uncertain attributes that don't contribute to 
the score can be ignored). We further assume no two tuples can take the same score for avoiding ties. We use r(t) to 
denote the random variable indicating the rank of t and r pw (t) to denote the rank of t in possible world pw. If t does 
not appear in the possible world pw, then r pw (t) = oo. So, Pr(r(£) > i) includes the probability that t's rank is larger 
than i and that t doesn't exist. We say t\ ranks higher than i 2 in possible world pit; if r pw (ti) < r pw (t2). 

Finally, we use the symbol r to denote rankings, and t 1 to denote the restriction of the Top-k list r to the first i 
items. We use r(i) to denote the i th item in the list r for positive integer i, and r(t) to denote the position of t € T in 
r. 

5.1 Distance between Two Top-k Answers 

Fagin et al. lfl6l provide a comprehensive analysis of the problem of comparing two Top-k lists. They present exten- 
sions of the Kendall's tau and Spearman footrule metrics (defined on full rankings) to Top-k lists and propose several 
other natural metrics, such as the intersection metric and Goodman and Kruskal's gamma function. In our paper, we 
consider three of the metrics discussed in that paper: the symmetric difference metric, the intersection metric and one 
particular extension to Spearman's footrule distance. We briefly recall some definitions here. For more details and the 
relation between different definitions, please refer to lfl6l . 

Given two Top-k lists, n and r 2 , the normalized symmetric difference metric is defined as: 

dA(n,r 2 ) = ^ |n Ar 2 | = sl(n\r 2 ) U (r 2 \n)|. 
While dA focuses only on the membership, the intersection metric dj also takes the order of tuples into consider- 
ation. It is defined to be: 

d/(Ti,r 2 ) = i£-=idA(rf,ri) 

Both dA and dj() values are always between and 1. 

The original Spearman's Footrule metric is defined as the L\ distance between two permutations o\ and o^- 
Formally, F{a\, cr 2 ) = J^teT l cr i(0 — a 2(t)\. Let £ be a integer greater than k. The footrule distance with location 
parameter I, denoted F^> generalizes the original footrule metric. It is obtained by placing all missing elements in 
each list at position i and then computing the usual footrule distance between them. A natural choice of I is k + 1 and 
we denote F^ e+1 ^ by . It is also proven that dp is a real metric and a member of a big and important equivalence 
class! Gi- 
lt is shown in lfl6l that: 

dir(n,r 2 ) = (k+l)| ri Ar 2 | 

+ E ln(*)-r 2 (t)|- E n(t)- J2 

tGTinr 2 t6ri\r2 ter 2 \ri 

Next we consider the problem of evaluating consensus answers for these distance metrics. 

3 All distance functions in one equivalence class are bounded by each other within a constant factor. This class includes several extensions of 
Spearman's footrule and Kendall's tau metrics. 
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5.2 Symmetric Difference and PT — k function 

In this section, we show how to find mean and median Top-k answers under symmetric difference metric in the and/xor 
tree model. The probabilistic threshold Top-k (PT — k) query j22l has been proposed for evaluating ranking queries 
over probabilistic databases, and essentially returns all tuples t for which Pr(r(i) < k) is greater than a given threshold. 
If we set the threshold carefully so that the PT — k query returns k tuples, we can show that the answer returned is the 
mean answer under symmetric difference metric. 

Theorem 3 If r = {t(1),t(2), . . . ,r(k)} is the set ofk tuples with the largest Pr(r(i) < k), then r is the mean 
Top-k answer under metric d&, i.e., the answer minimizes E[cIa(t, T pw )\. 

Proof: Suppose t is fixed. We write E[dA(i~, t p ,„ )] as follows: 

E[d A (T,T pw )] = E[2J<5(< erAt<£ r pw ) + S(t 6 r pw At £ r)] 

= E Pr(r(t) < k) + E Pr(r(i) > k) 

teT\r ter 

= k + E Pr(r(t) < k) - 2 ^ Pr(r(i) < k) 

teT te-r 

The first two terms are invariant with respect to r. Therefore, it is clear that the set of k tuples with the largest 
Pr(r(i) < k) minimizes the expectation. □ 

To find a median answer, we essentially need to find the Top-k answer r of some possible world such that Y^ter P r ( r W < 
k) is maximum. Next we show how to do this given an and/xor tree in polynomial time. 

We write P(t) = Pr(r(t) < k) for ease of notation. We use dynamic programming over the tree structure. For 
each possible attribute value a € A, let T a be the tree which contains all leaves with attribute value at least a. We 
recursively compute the set of tuples pw% , t , which maximizes the value J2 tepw a P(t) among all possible worlds 
generated by the subtree T£ rooted at v and of size i, for each node v in T a and 1 < i < k. We compute this for all 
different a values, and the optimal solution can be chosen to be min a (pu;" k ). 

Suppose vx, V2, ■ ■ ■ ,vi are w's children. The recursion formula is: 

• If v is a © node, pwf tl = argmax^gp^r^) J2te P w p ( f )- 

• If v is a ® node, pw® i — Ujpwj such that J2j \p w j I = i,pwj £ PW{T£. ) and X^teu pw ^ W is maximized. 

In the latter case, the maximum value can be computed by dynamic programming again as follows. Let pw^ Vi , i = 
Uj =1 pwj such that X^-=i \P w j I = hPWj G PW(T^ ) and J2 tGU h pw P(t) is maximized. It can be computed recur- 
sive by seeing pw^ Vh] i = pwf Vi ,...„ h _ l]iP U pw;,, for P; 9 such that p + q = % and Etepi»f„ ^ w \^ P ® 
is maximized. Then, it is easy to see pw a (v, i) is simply pw a ([v\, . . . , vi],i). 

Theorem 4 The median Top-k answer under symmetric difference metric can be found in polynomial time for a 
probabilistic and/xor tree. 

5.3 Intersection Metric 

Note that the intersection metric dj is a linear combination of the normalized asymmetric difference metric dA- Using 
a similar approach used in the proof of Theorem[3] we can show that: 

1 k 

E[d I (r,r pw )} = -^2E[d A (r i ,T FW )] 

i=l 

= \ E t ( k + E Pr ( r W ^ k ) - 2 E Pr ( r W ^ o ) 

i=l * V teT t£r' I 
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Thus we need to find r which maximizes the last term, A(t) = E»=i (7 Ete-r* P r ( r W < *))■ We first rewrite 
the objective as follows, using the indicator (S) function: 



A ^ = EuE Pr «^ i )) { ( t67i )| 

i=l V tGT / 

= E(E7 Pr W*)^')I>(* = r W)] 
= EE(*(* = ^)E7 Pr ( r (*)^)] 



The last equality holds since J2i=i Z)j=i a y" = £j=i Z)»=j a u - 

The optimization task can thus be written as an assignment problem, with each tuple t acting as an agent and each 
of the Top-k positions j as a task. Assigning task j to agent t gains a profit of Y^=j jP r ( r (t) -! *) an d the goal is to 
find an assignment such that each task is assigned to at most one agent, and the profit is maximized. The best known 
algorithm for computing the optimal assignment runs in 0(nky/n) time, via computing a maximum weight matching 
on bipartite graph ll30l . 

Approximating the Intersection Metric: We define the following ranking function, where Hf. denotes the k th Har- 
monic number: 

T H (t) = ]T> k - H^)Pr(r(t) = i) = £ . 

i=l i=l 

This is a special case of the parameterized ranking function proposed in (29| and can be computed in 0(nklog 2 n) 
time for all tuples in the and/xor tree. We claim that the Top-k answer th returned by Th function, i.e., the k tuples 
with the highest Th values, is a good approximation of the mean answer with respect to the intersection metric by 
arguing that th = {ti, t2, ■ ■ ■ , tk} is actually an approximated maximizer of A(t). Indeed, we prove the fact that 
A(th) > jj - ^ 7 "*) where t* is the optimal mean Top-k answer. 

Let B(t) = Y^tEr ~^ H (t) f° r an y Top-k answer r. It is easy to see A(t*) < B(r*) < B(th) since th maximizes 
the B() function. Then, we can get: 



k k 

a (t h ) = EE7 Pr ( r (*i)<o 



3=1 »=j 



> E(^^)E7Pr(r(t,)<0 

j=i k »=i 

= 

* iE(^^)X>&) 

2= 1 4 — 1 

= > i-A(r*). 

The second inequality holds because for non-decreasing sequences < i < n) and Cj(l < i < n) 
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E[F*(r,r pu 



(k + l)\rAr pw \ + ]T K*) -Tpw(t)\ - E T W _ E T P W ^ 



(k + 1)E[\tAt pw \] + J2E[5(t £Tr\T pw )\r(t) - r pw {t)\] - ^E[5(t G T \ r pw )r(t)} - E 



teT 



ter 



teT pul \T 



= (k + 1)E[|tAt pto |] + J2 E E E e T n T m») s (t = t p™ (*))*(* = t(7'))I» " ill 

teT i = l j = l 

k 

-^^E[J((6r\ V )J(( = r(,))i]- ]T T a (t) 
teTi=i ter\r 

k / k \ k 

= (k + l)E[|rAr p „|] + ^^ U(* = r(i))^Pr(r(t)=j)|i-j| - ^ («(* = r(i))iPr(r(t) > k)) - E T a (t) 
ieTi=i \ j=i J ten=i teT\ T 

k 

= (k + l)(k + ^T 1 (t)-2^T 1 (t)) + ^^5(t = r(i))T 3 (t,i)- E T2 W 
tGT ter teT i=l teT\T 

k 

= (k + l)k + E (( k + !) T i (*) - T 2W) + E E^* = ^))(T 3 (i, + T 3 (t) - 2(k + l)Ti(t)) 

teT t£Ti=l 

Figure 2: Derivation for Spearman's Footrule Distance 

5.4 Spearman's Footrule 

For a Top-k answer r = {r(l), r(2), . . . , r(k)}, we define: 

• T 1 (t)=E^= 1 Pr(r(* = i)) 

• T a (t)=E^=iPr(r(t = »))-i 

. T 3 (t, i) - £* =1 Pr(r(t) = - il + iPr(r(t) > k). 
It is easy to see Ti(t), T2(i), Ys(t) can be computed in polynomial time for a probabilistic and/xor tree using our 
generating functions method. 

A careful and non-trivial rewriting of E pw£ pw [F* (r, T pw )] shows that it also has the form (Figure[2]i: 



E pwePW [F*( T , t pw )] = C + J2J2 5( - t = T «)/(^ 



igT i=l 



where C is a constant independent of r, and /(£, i) is a function of t and i, which is polynomially computable. Figure 
|2]shows the exact derivation. 

Thus, we only need to minimize the second term, which can be modeled as the assignment problem and can be 
solved in polynomial time. 



5.5 Kendall's Tau Distance 

Then Kendall's tau distance (also called Kemeny distance) 6k between two Top-k lists n and t-i is defined to be the 
number of unordered pairs (tj , tj ) such that that the order of i and j disagree in any full rankings extended from n and 
T2, respectively. It is shown that dp and 6k and a few other generalizations of Spearman's footrule and Kendall's tau 
metrics form a big equivalence class, i.e., they are within a constant factor of each other JT6). Therefore, the optimal 
solution for implies constant approximations for all metrics in this class (the constant for d^ is 2). 

However, we can also easily obtain a 3/2-approximation for 6k by extending the 3/2-approximation for partial 
rank aggregation problem due to Ailon [Tj. The only information used in their algorithm is the proportion of lists 
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where ti is ranked higher than tj for all In our case, this corresponds to Pr(r(ij) < r(tj)). This can be easily 
computed in polynomial time using the generating functions method. 

We also note that the problem of optimally computing the mean answer is NP-hard for probabilistic and/xor trees. 
This follows from the fact that probabilistic and/xor trees can simulate arbitrary possible worlds, and previous work 
has shown that aggregating even 4 rankings under this distance metric is NP-Hard lfT4l . 



6 Other Types of Queries 

We briefly extend the notion of consensus answers to two other types of queries and present some initial results. 

6.1 Aggregate Queries 

Consider a query of the type: 

select groupname, count(*)from R group by groupname 
Suppose there are m potential groups (indexed by groupname) and n independent tuples with attribute uncertainty. 
The probabilistic database can be specified by the matrix P = [pi.j] n xm where pi.j is the probability that tuple i takes 
groupname j and Y^JLi Pi,j ~ 1 f° r an y 1 A query result (on a deterministic relation) is a m-dimensional 

vector r where the i th entry is the number of tuples having groupname i. The natural distance metric to use is the 
squared vector distance. 

Computing the mean answer is easy in this case, because of linearity of expectation: we simply take the mean for 
each aggregate separately, i.e., r = IP where 1 = (1,1,..., 1). We note the mean answer minimizes the expected 
squared vector distance to any possible answer. 

The median world requires that the returned answer be a possible answer. It is not clear how to solve this problem 
optimally in polynomial time. To enumerate all worlds is obviously not computationally feasible. Rounding entries of 
r to the nearest integers may not result in a possible answer. 

Next we present a polynomial time algorithm to find a closest possible answer to the mean world r. This yields a 
4-approximation for finding the median answer. We can model the problem as follows: Consider the bipartite graph 
B(U, V, E) where each node in U is a tuple, each node in V is a groupname, and an edge (u, v), u £ U,v E V 
indicates that tuple u takes groupname v with non-zero probability. We call a subgraph G such that degc (u) = 1 
for all u G U and dega'(v) = r[v), an v-matching of B for some m-dimensional integral vector r. Given this, our 
objective is to find an r-matching of B such that ||r — r 1 1 2 is minimized. Before presenting the main algorithm, we 
need the following lemma. 

Lemma 3 The possible world r* that is closest to f is of the following form: v*[i] is either [r[i]\ or \r[i]~\ for each 
1 < i < m. 

Proof: Let M* be the corresponding r*-matching. Suppose the lemma is not true, and there exists i such that 
\r*[i] — f[i]\ > 1. W.l.o.g, we assume r*[i] > f[i]. The other case can be proved the same way. Consider the 
connected component K = {[/', V, E(U', V')} containing i. We claim that there exists j 6 V' such that r* [j] < r[j] 
and there is an alternating path P with respect to M* connecting i and j. Therefore, M' = M* P is also a valid 
matching. Suppose M' is a r'-matching. But: 



|r'-f||| 



J»]-rM) 2 

r-1 
m 

^(r*[t-]-r[t-]) 2 -(r*[i]-f[i]) 2 - 

(r*[j] - r[j]) 2 + (r'W - r[i]) 2 + (r'[j] - r[j]) 2 
||r*-r|| 2 -(r*[i]-r[i]) 2 -(r*[i]-r[j]) 2 
+(r*[i]-l-r[ i ]) 2 + (r*[j] + l-r[j]) 2 
||r* -f||l + 2 - 2r*[i] + 2f[t] + 2r*\j] - 2f[j] 
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This contradicts the assumption r* is the vector closest to r. 

Now, we prove the claim. We grow a alternating path (w.r.t. M*) tree rooted at i in a BFS manner: at odd depth, 
we extend all edges in M* and at even depth, we extend all edge not in M*. Let O C V be the set of nodes at odd 
depth (i is at depth 1) and ECU the set of nodes at even depth. It is easy to see Nb(E) — O, E C Nb(0) and 
Sueo r *M = l-^l- Suppose r*[v] > f[v] for all v and r*[i] > f[i]. However, the contradiction follows since: 

i£i = $»]>£ f M = £ 2 p[u,«] 

«go ueo veo ueN B {o) 

□ 

With Lemma |3] at hand, we can construct the following min-cost network flow instance to compute the vector r* 
closest to r. Add to B a source s and a sink t. Add edges (s, u) with capacity upper bound 1 for all u € U. For each 
v d V and f[v] is not integer, add two edges e\{v, t) and e-2(v, t). e\(v, t) has both lower and upper bound of capacity 
[r[v]\ and e^v, t) has capacity upper bound 1 and cost (|~f [v]~\ — r[v}) 2 — ([f[v]\ — f [v]) 2 . If r[v] is a integer, we only 
add e\(v, t). We find a min-cost integral flow of value n on this network. For any v such that e2(v, t) is saturated, we 
set r* [v] to be |~r] and |_rj otherwise. Such a flow with minimum cost suggests the optimality of the vector r* due to 
Lemma |3] 

Theorem 5 There is a polynomial time algorithm for finding the vector r* to f such that r* corresponds to some 
possible answer with non-zero probability. 

Finally, we can prove that: 

Corollary 2 There is a polynomial time deterministic 4-approximation for finding the median aggregate answer. 

Proof: Suppose r* is the answer closest to the mean answer r and r m is the median answer. Let r be the vector 
corresponding to the random answer. Then: 

E[d(r»] < E[2(d(r*,r)+d(r,r))]=2(d(p*,r) + E[d(r,r)]) 
< 4E[d(f,r)] < 4E[d(r m ,r)]. 



6.2 Clustering 

The Consensus-Clustering problem is defined as follows: given k clusterings C\, . . . ,Ck of V, find a clustering 
C that minimizes d(C,Cj). I n me setting of probabilistic databases, the given clusterings are the clusterings in 

the possible worlds, weighted by the existence probability. The main problem with extending the notion of consensus 
answers to clustering is that the input clusterings are not well-defined (unlike ranking where the score function defines 
the ranking in any world). We consider a somewhat simplified version of the problem, where we assume that two 
tuples ti and tj are clustered together in a possible world, if and only if they take the same value for the value attribute 
A (which is uncertain). Thus, a possible world pw uniquely determines a clustering C pw . We define the distance 
between two clustering C\ and C2 to be the number of unordered pairs of tuples that are clustered together in C\, but 
separated in the other (the CONSENSUS-CLUSTERING metric). To deal with nonexistent keys in a possible world, we 
artifically create a cluster containing all of those. 

Our task is to find a mean clustering C such that E[d(C,C pw )]. Approximation with factor of 4/3 is known for 
Consensus-Clustering |2|, and can be adapted to our problem in a straightforward manner. In fact, that approxi- 
mation algorithm simply needs Wt iy t j for all ti, tj, where iut ( ,t, is the fraction of input clusters that cluster ij and tj 
together, and can be computed as: Wt u tj = TlaeA P r (*-^ = a A j.A = a). 

To compute these quantities given an and/xor tree, we associate a variable x with all leaves with value (i, a) and 
(j, a), and constant 1 with the other leaves. From TheoremQ] Pr(i.A = a A j.A = a) is simply the coefficient of x 2 
in the corresponding generating function. 
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7 Conclusion 



We addressed the problem of finding a single representative answer to a query over probabilistic databases by gener- 
alizing the notion of inconsistent information integration. We believe this approach provides a systematic and formal 
way to reason about the semantics of probabilistic query answers, especially for Top-k queries. Our initial work has 
opened up many interesting avenues for future work. These include design of efficient exact and approximate algo- 
rithms for finding consensus answers for other types of queries, exploring connections to safe plans, and understanding 
the semantics of the other previously proposed ranking functions using this framework. 
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